Advances in Dialectal Arabic Speech Recognition: A Study Using Twitter to Improve Egyptian ASR
نویسندگان
چکیده
This paper reports results in building an Egyptian Arabic speech recognition system as an example for under-resourced languages. We investigated different approaches to build the system using 10 hours for training the acoustic model, and results for both grapheme system and phoneme system using MADA. The phoneme-based system shows better results than the grapheme-based system. In this paper, we explore the use of tweets written in dialectal Arabic. Using 880K Egyptian tweets reduced the Out Of Vocabulary (OOV) rate from 15.1% to 3.2% and the WER from 59.6% to 44.7%, a relative gain 25% in WER.
منابع مشابه
Collecting Data for Automatic Speech Recognition Systems in Dialectal Arabic Using Games with a Purpose
Building Automatic Speech Recognition (ASR) systems for spoken languages usually suffer from the problem of limited available transcriptions. Automatic Speech Recognition (ASR) systems require large speech corpora that contain speech and their corresponding transcriptions for training acoustic models. In this paper, we target the Egyptian dialectal Arabic. As other spoken languages, it is mainl...
متن کاملMulti-Reference Evaluation for Dialectal Speech Recognition System: A Study for Egyptian ASR
Dialectal Arabic has no standard orthographic representation. This creates a challenge when evaluating an Automatic Speech Recognition (ASR) system for dialect. Since the reference transcription text can vary widely from one user to another, we propose an innovative approach for evaluating dialectal speech recognition using Multi-References. For each recognized speech segments, we ask five diff...
متن کاملBest Practices for Crowdsourcing Dialectal Arabic Speech Transcription
In this paper, we investigate different approaches in crowdsourcing transcriptions of Dialectal Arabic speech with automatic quality control to ensure good transcription at the source. Since Dialectal Arabic has no standard orthographic representation, it is very challenging to perform quality control. We propose a complete recipe for speech transcription quality control that includes using out...
متن کاملCrowdsource a little to label a lot: labeling a speech corpus of dialectal Arabic
Arabic is a language with great dialectal variety, with Modern Standard Arabic (MSA) being the only standardized dialect. Spoken Arabic is characterized by frequent code-switching between MSA and Dialectal Arabic (DA). DA varieties are typically differentiated by region, but despite their wide-spread usage, they are under-resourced and lack viable corpora and tools necessary for speech recognit...
متن کاملAnalysis of Dialectal Influence in Pan-Arabic ASR
In this paper, we analyze the impact of five Arabic dialects on the front-end and pronunciation dictionary components of an Automatic Speech Recognition (ASR) system. We use ASR’s phonetic decision tree as a diagnostic tool to compare the robustness of MFCC and MLP front-ends to dialectal variations in the speech data and found that MLP Bottle-Neck features are less robust to such variations. W...
متن کامل